Revisiting Gaussian Process Regression of the Mauna Loa Dataset using Automated Kernel Engineering

Alessio Tamburro


The source code for this article is available here

Data and more info on the subject here

Note: the html version of this notebook was created with nbconvert and running the command jupyter nbconvert --to html --TagRemovePreprocessor.remove_input_tags="notvisible" notebook.ipynb. A tag "notvisible" was added to the cells that are not displayed in this rendered html

In this study, we want to forecast the monthly average atmospheric $CO_2$ concentrations (in parts per million by volume (ppm)) collected at the Mauna Loa Observatory in Hawaii. The dataset used for this study spans the period between 1958 and 2001. The goal is to train a model of the $CO_2$ concentration as a function of the time for the data collected until 2001 and extracpolate for the following years until today. A similar study is available from the book of Carl E. Rasmussen and Christopher K.I. Williams, “Gaussian Processes for Machine Learning”, MIT Press 2006 and has been demonstrated using the Python library of scikit-learn for Gaussian Process regression gaussian_process.

Previous Study

We will start by reproducing the exercise of Rasmussen and Williams following the scikit-learn notebook code mentioned earlier. The data is publicly available through the OpenML project.

We take the monthly average and drop the months for which no measurements were collected.

Looking at the data, we can make some assumptions regarding the function describing the measurements. We can observe a long term rising trend, a seasonal variation and smaller irregularities. These assumptions will help us to engineer manually the kernel to use for fitting a Gaussian process regression model.

After we fit the model using the given kernel, the composed kernel parameter values are also optimized by maximizing the log marginal likelihood

$\log p(y|X) = -\frac{1}{2}(y - m(X))^T(K+\sigma^2_n I)^{-1}(y - m(X)) - \frac{1}{2}\log|K+\sigma^2_n I|-\frac{n}{2}\log2\pi$,

where $m: \mathbb{R}^d \rightarrow \mathbb{R}$ is the mean function of a Gaussian process and the notation $m(X)$ represents a vector function obtained by applying the mean function to each of the $n$ points in $y$; $y$ is the target, $K$ is the prior covariance (more on this a little later in this notebook), $\sigma^2_n$ is the noise variance, $\log|\dot|$ indicates the logarithm of the determinant. More on this topic can be reviewed in the book of Carl E. Rasmussen and Christopher K.I. Williams, “Gaussian Processes for Machine Learning”, MIT Press 2006 (eq. 2.30).

The optimized kernel components are the following:

We will now use the model to extrapolate on future data. To accomplish this, we create synthetic data up to the current month.

Automated Kernel Engineering

In the previous exercise, we created the kernel for our Gaussian process regression model based on human intuition and interpretation of the underlying data generation process. Another approach is to seach for an appropriate kernel structural form by composing different base kernels. A paper about compositional kernel search is available here. The reader can also learn about kernel composition here. In fact, we can search a space of kernel structures, which are built by adding and multiplying a set of base kernels. The search can be challenging and we will propose a method that allows us to search optimally the space of possible kernels.

Using sums of base kernels, we can model the data as a superposition of independent generating functions. This allows for representing different structures that are present in the data. Using products of kernels allows for taking into consideration possible interactions between different input variables.

Selecting a kernel when fitting a Gaussian process regression model means assuming a function that defines the "similarity" of pairs of data points. Data points that are similar or close together are assumed to have similar target values during the fitting process. There are two main categories of kernels:

Stationary kernels can also be further classified as:

A list of available base kernels in scikit-learn can be fould here. To interact with kernel parameter values, the reader can look at this page.

Brief Discussion on Base Kernels

An example of stationary, isotropic base kernel is the Radial Basis (RBF) function

$k(x_i, x_j) = \text{exp}\left(- \frac{d(x_i, x_j)^2}{2l^2} \right)$,

where $l$ is called length scale and $d$ is the Euclidean distance between pairs of points. We can use the Mauna Loa input variable to show the long term rising term, which used a RBF.

The contour plot above displays the kernel values of the long term rising component. The x- and y-axis are the indexes associated to the input values. Given two data points, the length scale determines the distance where one data point is still expected to influence the other data point. The constant that multiplies the RBF determines the magnitude of this influence (z-axis scale). We can see, for example, the two highlighted black points at (50, 450) and (300, 400). The influence of each data point on the other for the data points at indexes 50 and 450 is less than the influence observed for the data points at indexes 300 and 400. In fact, the influence that each data point has on another data point for a given pair fades with a length scale of ~52.

The kernel functional describing the irregularities (RationalQuadratic) of the generating process is:

$k(x_i, x_j) = \left(1 + \frac{d(x_i, x_j)^2}{2\alpha l^2}\right)^{-\alpha}$.

This can be interpreted as an infinite sum of RBF kernels having different length scales. The scale mixture parameter $\alpha$ determines the difference in the length scales of the infinitesimal RBF components.

Changing the length scale here will have the same effect described for the RBF kernel. However, $\alpha$ balances between short and long-term variations in the function. If $\alpha$ is small, the kernel allows for more non-smooth and non-linear behavior. We can observe how the largest kernel values are along the diagonal highlighting how this kernel emphasizes the local irregularities.
If we assign a much smaller value to $\alpha$, we can see below how these irregularities are allowed to "propagate" further where the distance between points in a pair is larger.

The last kernel we want to discuss here is the ExpSineSquared kernel, whose functional is:

$k(x_i, x_j) = \text{exp}\left(- \frac{ 2\sin^2(\pi d(x_i, x_j) / p) }{ l^ 2} \right)$

The additional parameter $p$ is the periodicity parameter. The periodical component alone for our exercise with the Mauna Loa dataset is shown below.

As we can see, the period of 1 is markedly visible in the contour plot. The length scale determines how many neighboring points influence each other (widths of the periodic bands).

Combining Base Kernels

For the Mauna Loa dataset, we used a decay time of ~90 years on top of the periodic component. This was an example of kernel composition.

As we can see, the pair of points indicated with the black dot at (50, 450) is in a slightly darker area than the black dot for the pair at (250, 300). This indicates that there is less "similarity" between data points with greater distances. The result is a periodic component with a decaying amplitude.

Multiplying kernels is like an AND operation. In fact, two data points are considered similar only if both kernels have high values. Similarly, a sum of kernels can be thought of as an OR operation: two data points are considered similar if either kernel has a high value. We show a sum of kernels below.

Finding An Optimal Combination of Kernels

We build an optimized polynomial with its base elements belonging to the list of $M$ kernel functions selected from a list of $N$ available base kernel functions

$P(\mathbf{x}) = \epsilon (x) + \sum_{K_1, K_2, \ldots , K_M\in\{K_1,\dots, K_N\}} a_{K_1, K_2, \ldots, K_M} K_1(x) K_2(x) \ldots K_M(x)$,

where $a_{K_1, K_2, \ldots, K_M}$ represents the ConstantKernel used as the coefficient of the polynomial term containing $K_1, K_2, \ldots , K_M$ and $E(x)$ represents the noise kernel function.

For example, we can consider the following list of base kernels (available in scikit-learn) to compose the kernel polynomial.

For simplicity, we show below an example of kernel composition using an initial set of 5 base kernels, labeled A, B, C, D, E.

We can require our kernel polynomial to have interaction terms such as $A*B$ or $A*B*C$ up to a given degree level. For example, we can enforce to have up to 2-degree interaction terms. In this case, the kernel polynomial will not have terms such as $A*B*C$ but it will have terms such as $A$, $A*B$, etc.

We can also limit the number of polynomial terms. For example, we can allow for polynomials with up to 3 terms. In this case, polynomials such as $A + A*B$ and $A + A*B + A*C$ will be available but polynomials such as $A + A*B + A*C + A*E$ will not be available.

We can select the kernel polynomial by index.

In the example above, we let interc_coeff='k' be the coefficient for all terms. This will be replaced with a ConstantKernel whose parameter value will be optimized. Also, interc_coeff here represents the independednt additional error that can be added to the kernel, typically WhiteNoise() + RBF().

The search for the optimal kernel polynomial is typically a non-convex optimization problem with many local minima/optimal kernels. In fact, a large number of models which can perfectly fit a finite dataset can be found. This can be limited by reducing the kernel space to search, based on existing knowledge of the process that generated the data.

We will use the Covariance Matrix Adaptation Evolution Strategy (CMA-ES) algorithm, which is an evolutionary strategy (stochastic, derivative-free) that is available in Optuna as optuna.samplers.CmaEsSampler. Note that this sampler does not support categorical distributions. For these situations, we would opt for the TPESampler (Tree-structured Parzen Estimator). The latter optimization approach is described in details in Bergstra, James S., et al. “Algorithms for hyper-parameter optimization.” Advances in Neural Information Processing Systems. 2011 and Bergstra, James, Daniel Yamins, and David Cox. “Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures.” Proceedings of The 30th International Conference on Machine Learning. 2013.

The search is set to explore 1000 trials with an early stopping if no improvement is found in 200 trials. The objective function builds and trains a modeling pipeline that preprocesses the data (scaling) and fits a Gaussian process regression model using the kernel of the trial. For smaller datasets (50 or fewer data rows), minimizing the Bayesian Information Criterion (BIC) drives the search for the optimal polynomial kernel. BIC is defined as

$BIC = k\log(n) - 2\log(\hat{L})$

where $\hat{L}$ is the estimated log-likelihood of the Gaussian process regression model, $n$ is the number of data rows, $k$ is the number of parameters of the kernel function. The hyperparameter values of the kernel are optimized when fitting the Gaussian process regressor by maximizing the log-marginal-likelihood (LML) as described here.

For larger datasets, adopting the approach described above would lead to impracticable computational times. Therefore, we stratify the dataset based on the target variable using KBinsDiscretizer. The target values are binned so that all bins have the same number of data (quantile strategy). The stratification happens by utilizing the bin IDs. At each trial, 50 data rows are sampled using this stratification of the target and are used to train the Gaussian process regressor while optimizing its kernel hyperparameter values. Different data samples are utilized at each trial. In the case of a kernel already tried previously in the search, the kernel is initialized with the hyperparameter values found earlier. The whole dataset is then used to only fit the model without further optimizing its kernel hyperparameter values. The search of the optimal polynomial kernel is driven by minimizing BIC calculated using the LML from this fit. A fundamental challenge with Gaussian processes is scalability. The covariance matrix inversion complexity increases quadratically with the size of the data.

To test this approach, we let Optuna find an optimal polynomial with no more than 2 base kernels in product terms and no more than 3 total terms plus the additional noise term. This will allows us to see how the result compares to the kernel used in the study described in the previous section.

Running multiple times may result into slightly different kernel polynomials since this optimization problem is not convex. In fact, we should always look at the top trials to make a decision on the kernel to adopt based on our intuition of the problem.

The kernel with the lowest BIC value is the following:

We can also look at other optimal kernels sorted by increasing $BIC$ values

Finally, we can show the different components of the optimized kernel to identify their "signatures". As we can see below the dominant component is

37.5*2 RBF(length_scale=4.23) * ExpSineSquared(length_scale=7.14, periodicity=0.0798)

As before, we can show the fit and the forecast lines.

Copyright (c) [2023] [Alessio Tamburro]

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

MIT license